Search CORE

79 research outputs found

Evaluation of contextual embeddings on less-resourced languages

Author: Armendariz CS
Pollak S
Purver M
Repar A
Robnik-Šikonja M
Ulčar M
Žagar A
Publication venue
Publication date: 01/01/2021
Field of study

The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models

Queen Mary Research Online

CoSimLex : A Resource for Evaluating Graded Word Similarity in Context

Author: Granroth-Wilding M
Language Resources and Evaluation Conference
Ljubešić N
Pollak S
Purver M
Robnik-Šikonja M
Santos Armendariz C
Ulčar M
Vaik K
Publication venue: EUROPEAN LANGUAGE RESOURCES ASSOC-ELRA
Publication date: 01/01/2020
Field of study

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far.Peer reviewe

arXiv.org e-Print Archive

Common Language Resources and Technology Infrastructure - Slovenia

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

Queen Mary Research Online

EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions

Author: Boggia M
Boros E
Cabrera-Diego LA
Doucet A
EACL workshop on News Media Content Analysis and Automated Report Generation
Freiental L
Koloski B
Kranjc J
Krustok I
Lavrač N
Leppànen L
Linden C-G
Martinc M
Moreno J
Paju T
Pelicon A
Podpečan V
Pollak S
Pranjić M
Purver M
Robnik-Šikonja M
Salmela S
Sheehan S
Shekhar R
Toivonen H
Traat S
Ulčar M
Zosa E
Škrlj B
Žnidaršič M
Publication venue
Publication date: 19/04/2021
Field of study

Queen Mary Research Online

Interpretation of microbiota-based diagnostics by explaining individual classifier decisions

Author: A. E. Budding
A. Eck
AE Budding
AS Day
B Gaonkar
C Casen
C Manichanh
D Baehrens
D Gevers
D Knights
E Bellaguarda
E Štrumbelj
E. F. J. de Groot
EK Costello
EK Wright
I Kononenko
I Sekirov
JA Sanford
L Daniels
L. M. Zintgraf
M Robnik-Šikonja
M. Welling
N Rolhion
P. H. M. Savelkoul
R Tibshirani
T. G. J. de Meij
T. S. Cohen
TG de Meij
Y Luo
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A P2P Botnet detection scheme based on decision tree and adaptive multilayer neural networks

Author: A Dries
A Nigrin
A Shiravi
AK Jain
C Ludl
C-F Tsai
G Fedynyshyn
H Jiang
H Li
H Nguyen
IH Witten
J Felix
J Zhang
K Wang
K-S Han
L Breiman
Li Zhang
M Hall
M Robnik-Šikonja
M. A. Hossain
MA Razi
Mohammad Alauthaman
Nauman Aslam
P Putten Van der
P Wang
P-N Tan
R Babak
RA Rodríguez-Gómez
Rafe Alasem
S Shin
SRSC Silva
T Holz
T Zhang
W Lu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

In recent years, Botnets have been adopted as a popular method to carry and spread many malicious codes on the Internet. These malicious codes pave the way to execute many fraudulent activities including spam mail, distributed denial-of-service attacks and click fraud. While many Botnets are set up using centralized communication architecture, the peer-to-peer (P2P) Botnets can adopt a decentralized architecture using an overlay network for exchanging command and control data making their detection even more difficult. This work presents a method of P2P Bot detection based on an adaptive multilayer feed-forward neural network in cooperation with decision trees. A classification and regression tree is applied as a feature selection technique to select relevant features. With these features, a multilayer feed-forward neural network training model is created using a resilient back-propagation learning algorithm. A comparison of feature set selection based on the decision tree, principal component analysis and the ReliefF algorithm indicated that the neural network model with features selection based on decision tree has a better identification accuracy along with lower rates of false positives. The usefulness of the proposed approach is demonstrated by conducting experiments on real network traffic datasets. In these experiments, an average detection rate of 99.08 % with false positive rate of 0.75 % was observed

Northumbria Research Link

Crossref

Springer - Publisher Connector

Teeside University's Research Repository

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data

Author: A Blum
A Tsymbal
Albert Y Zomaya
B Liu
Bing B Zhou
C Ding
C Ooi
D Ruta
G Bontempi
I Inza
IH Witten
J Hua
J Liu
JR Quinlan
JR Quinlan
L Lam
L Li
M Hassan
M Kudo
M Robnik-Šikonja
P Jafari
Pengyi Yang
R Kohavi
RL Somorjai
S Armstrong
S Dudoit
T Golub
T Jirapech-Umpai
T Mitchell
TG Dietterich
U Alon
W Li
X Chen
Y Saeys
Y Saeys
Y Su
Y Wang
YH Yang
Z Zhang
Z Zhang
Zili Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.Results: In this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system.Conclusion: We used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences. <br /

Deakin Research Online

Crossref

Springer - Publisher Connector

PubMed Central

A Markov blanket-based method for detecting causal SNPs in GWAS

Author: A Hamosh
BA McKinney
Bing Han
C Kooperberg
C-c Chang
CF Aliferis
D Koller
D Margaritis
DF Easton
HJ Cordell
I Tsamardinos
I Tsamardinos
J Fellay
J Li
J Marchini
JH McDonald
JH Moore
JK Pritchard
LW Hahn
M Robnik-Šikonja
MD Ritchie
MD Ritchie
MD Shriver
Meeyoung Park
MY Park
P Spirtes
R Jiang
RJ Klein
RR Sokal
SE Antonarakis
SH Chen
SK Musani
ST Sherry
X-W Chen
Xue-wen Chen
Y Zhang
Publication venue: BioMed Central
Publication date: 01/04/2010
Field of study

Abstract Background Detecting epistatic interactions associated with complex and common diseases can help to improve prevention, diagnosis and treatment of these diseases. With the development of genome-wide association studies (GWAS), designing powerful and robust computational method for identifying epistatic interactions associated with common diseases becomes a great challenge to bioinformatics society, because the study of epistatic interactions often deals with the large size of the genotyped data and the huge amount of combinations of all the possible genetic factors. Most existing computational detection methods are based on the classification capacity of SNP sets, which may fail to identify SNP sets that are strongly associated with the diseases and introduce a lot of false positives. In addition, most methods are not suitable for genome-wide scale studies due to their computational complexity. Results We propose a new Markov Blanket-based method, DASSO-MB (Detection of ASSOciations using Markov Blanket) to detect epistatic interactions in case-control GWAS. Markov blanket of a target variable T can completely shield T from all other variables. Thus, we can guarantee that the SNP set detected by DASSO-MB has a strong association with diseases and contains fewest false positives. Furthermore, DASSO-MB uses a heuristic search strategy by calculating the association between variables to avoid the time-consuming training process as in other machine-learning methods. We apply our algorithm to simulated datasets and a real case-control dataset. We compare DASSO-MB to other commonly-used methods and show that our method significantly outperforms other methods and is capable of finding SNPs strongly associated with diseases. Conclusions Our study shows that DASSO-MB can identify a minimal set of causal SNPs associated with diseases, which contains less false positives compared to other existing methods. Given the huge size of genomic dataset produced by GWAS, this is critical in saving the potential costs of biological experiments and being an efficient guideline for pathogenesis research.</p

Crossref

Directory of Open Access Journals

KU ScholarWorks

PubMed Central